Method for evaluating a quality of voice onset of a speaker

ABSTRACT

In a method for evaluating the voice onset of a speaker, especially suited for treatment of a stuttering disorder, the analysis includes: a. determine a time of voice onset of the speaker; b. obtain a fundamental frequency at the time of voice onset; c. in a predetermined time interval, obtain the curve with respect to time of energy at the fundamental frequency; d. obtain the curve with respect to time of energy at at least one harmonic multiple of the fundamental frequency; and e. determine the temporal progression of the ratio of the energies obtained in steps c and d. A gentle voice onset is presumed if the energy ratio is initially dominated by the energy of the fundamental frequency, and only in the further course of the predetermined time interval in a time span of Δt the energy ratio shifts in favor of the energy/energies of the harmonic multiple(a) of the fundamental frequency.

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims priority to European patent application number16186498.8 filed Aug. 31, 2016, the contents of which is incorporatedherein by reference in its entirety.

TECHNICAL FIELD

The present invention relates to a method for evaluating a quality of avoice onset of a speaker. It further relates to a data processingprogram for computerized automated evaluation of a quality of voiceonset of a speaker and a computerized data processing device having sucha data processing program. The present invention is especially suitedfor the treatment of stuttering disorder.

BACKGROUND

The evaluation of the quality of the voice onset, especially after aso-called hard or gentle voice onset, is important for speech training.This applies especially, but not exclusively, to speech training forstutterers, but also to the training of speech therapists, who mustlearn a deliberate voice onset. To further convey this capability, theevaluation of the quality of a voice onset is important, as it also isfor example in the training of persons who work a great deal with theirspeech.

Stuttering is a speech disorder in which involuntary repetitions,prolongations, and blocks occur. Worldwide, around 1% of the populationis affected. Stutterers suffer in significant part from social anxietiesand exclusion. The speech disorder according to the present state ofknowledge is considered incurable, but by means of speech exercises itcan be managed quite well and its effects can be well suppressed. Thereis thus on the one hand a need to develop and to offer a suitable speechexercise program; from the commercial standpoint there is a considerablemarket for the corresponding services or products.

One widespread and successful approach to the method procedure is thespeech technique approach. Here a new and altered speech mode is learnedin which stutter events occur less often. One of the most prominentversions of these is Webster's “Precision Fluency Shaping Program”(PFSP), which was developed back in the early 1970s and wasdistinguished by gentle voice onsets, stretching of sounds, andattenuation and voicing of consonants.

One component of work with stutterers quite often is computer-supportedfeedback for pronunciation. A German adaptation of a PFSP program underthe name of the Kasseler Stuttering Therapy (KST) has been establishedand found to be highly effective.

A core aspect of fluency shaping and related approaches is the gentlevoice onset. Using this, blocks do not even occur. Gentle voice onsetsare frequently used by stutterers intuitively in order to dispel blocks.With gentle voice onsets, tone production is formed through the glottiswithout a stop first having been formed by the glottis. In this way thevoice begins gently and quietly. In contrast to this, a hard voice onsetis characterized by a prior glottal stop, which then is cleared in theform of a plosive-like sound. These differences may be understood forexample by comparison of the electroglottogram and speech signal or inhigh-speed videos of the glottis. Gentle voice onsets are in additionmarked by the absence of laryngealization (creaky voice, that is,irregular low-frequency vocal chord vibrations).

Webster's Fluency Shaping and related methods such as KST usecomputerized automatic analysis of the speech signal in order to providethe subject with feedback on the voice onsets. Existing methods for therecognition of voice onsets and evaluation of the quality of same usethe fact that the gentleness of onset normally is manifested in agradual increase in sound intensity.

For example, Webster uses the progression of sound intensity over thefirst 100 ms of voice onset in order to achieve an automaticclassification of a gentle or hard voice onset. Other authors alsodescribed the use of the sound level curve (i.e. its temporal evolution)of an utterance in order to automatically evaluate voice onsets. Thecorresponding descriptions may be found for example in U.S. Pat. No.4,020,567.

Among a number of parameters that describe the sound level curve, thelogarithm of the so-called rise time is identified as the best one (theterm used by H. F. Peters et al. in “perceptual judgment of abruptnessof voice insertion in vowels as a function of the amplitude envelope,”Journal of Speech and Hearing Disorders, vol. 51, no. 4, pp. 299-308,1986, the time between attainment of 10% and 90% of maximal soundlevel). Under the very controlled conditions of this study, thisparameter correlates very well with the perceptive (gradual) estimate ofgentleness of voice onset.

US 2012/0116772 A1 describes a client-server system for supportinggeneral speech therapy. The speech of the patient is received on amobile terminal device and is automatically analyzed. The speech signalis transmitted to a server in order to perform further automaticanalyses and to give the therapist the opportunity to intervene in thetherapy. The automatic analyses also include analysis of voice onset.Likewise, fluency shaping as a therapy approach for stuttering ismentioned. The extraction of acoustic parameters for automatic analysisis described very broadly and generically, however. For analysis ofvoice onsets, no further information is given as to how this is done andwhat parameters are used.

To the extent they are described in any detail, the systems from theprior art that are used for classification of the voice onset all resortsolely to the sound level curve. This is problematic for two reasons.

First of all, a gentle and gradual sound level increase is neithersufficient nor necessary for a gentle voice onset. For example, a creakyvoice onset can have such a sound level increase. Creakiness, however,is to be avoided in any case in fluency shaping.

On the other hand, a voice onset can be gentle and still rise quickly insound level. Thus, the sound level is viewed as an insufficient orincomplete parameter for determining and classifying a voice onset as“gentle” or “not gentle.”

Although taking into account only loudness is congruent with the usualhelpful advice to produce gentle and gradually intensifying voiceonsets, it is still problematic. Under very controlled conditions,parameters from the sound level curve may lead to success. But underrealistic conditions, i.e. with non-prototypical voice onsets and withheterogeneous and/or cheaper and therefore less exact audio accessories,the classification task may be markedly more difficult and may not beprecisely achieved solely by means of analysis of the sound level curve.

SUMMARY

Against this background, the object of the invention is a method forevaluating the quality of a voice onset in which in an automatedprocess, especially computerized, gentle voice onsets of quite differentspeakers can be reliably recognized. In addition, a method based on sucha method is to be provided for treatment of persons with speechdisorders.

This object is achieved by a method for evaluating a quality of a voiceonset of a speaker in accordance with the disclosed embodiments. Afurther aspect of the solution in addition lies in a data processingprogram with the features disclosed herein. Finally, the invention alsoprovides a computerized data processing device.

In the method according to the invention for evaluating a quality ofvoice onset of a speaker, an acoustic speech signal of the speaker isaccordingly recorded and converted to a digital speech signal. Thedigital speech signal is then analyzed in its temporal progression inorder to

a. in the temporal progression of the digital speech signal, determinethe time of voice onset of the speaker,

b. obtain a fundamental frequency of a speech signal at the time ofvoice onset,

c. from the digital speech signal, in a predetermined time interval fromthe time of voice onset, extract the curve with respect to time of theenergy contained in the speech signal at the fundamental frequency;

d. from the digital speech signal, in the predetermined time interval,obtain the curve with respect to time of the energy contained in thespeech signal for at least one harmonic multiple of the fundamentalfrequency;

e. determine the temporal progression of the ratio of the energiesobtained in steps c and d.

With recourse to the energies thus determined under c. and d. and theirratio determined under e., a gentle voice onset is presumed if in thetime interval the ratio of energies obtained in e. is first dominated bythe energy of the fundamental frequency, and only further on in thepredetermined time interval in a time span of Δt the ratio of energiesshifts in favor of the harmonic multiple(s) of the fundamentalfrequency.

Here in other words, therefore, the focus is not on the sound level andits curve (and thus the energy) of the entire speech event during andafter the voice onset, but there is a detailed consideration ofdifferent parts of the voice signal. Data recorded and evaluated by theinventors in fact showed that the sound level curve does not result insatisfactory reliability of classification of the desired behavior ofgentle onset.

The method according to the invention now uses the realization acquiredby the inventors that gentle onsets are above all distinguished by aspecial voice quality at the start of the speech event. Due to the factthat the amplitude of vibrations of the vocal cords in a gentle onset atfirst rises slowly, as the inventors recognized, initially in the speechsignal chiefly the fundamental frequency is represented, while theirharmonic multiples are scarcely present. The speech signal at the voiceonset and in the first subsequent phase is approximately sinusoidal, asis shown in FIG. 1. Only when the vibrations of the vocal cords havereached their maximum is there a periodic vocal cord closure and thusthe production of the normal voice with its strong proportions ofharmonic multiples.

With a hard voice onset, on the other hand, the release of the priorglottis closure immediately initiates normal voice production withperiodic glottal stops and the presence of harmonic multiples of thefundamental frequency. This is shown in FIG. 2. This aspect of voicequality is extracted using suitable acoustic parameters in accordancewith the invention by means of the energy ratio of the oscillationcomponents at the fundamental frequency to the oscillation components atthe harmonic multiples over a predetermined time interval. For example,this aspect can be judged with reference to the ratio of energies of thefirst harmonic (thus the fundamental frequency) and the second harmonicmultiples of the fundamental frequency measured in the first 10 ms aftervoice onset. Thus, within the scope of the invention it is notabsolutely necessary to consider the energies of all harmonic multiplesof the fundamental frequency. Since normally the lower-order harmonicsoscillate with much more energy than the higher-order harmonics, it cansuffice to concentrate solely on the lower-order harmonics, for exampleprecisely on the second harmonic (doubled fundamental frequency).

Within the scope of the invention, a summation of energies in the rangeof 0.5*F₀ to 1.5*F₀ can be assumed to be the energy of the fundamentalfrequency F₀, in order to take into account a “blurring” of thefrequency and of the energy contained therein. The energy thus obtainedat the fundamental frequency can be related, for example, to that in abroad frequency band above the fundamental frequency (for example theenergy in the range of 50 to 4,000 Hz), in order to characterize thevoice onset then from the temporal development of this ratio.

For different examples, the following values of these energy ratios wereobtained in experiments:

-   -   a. Ratio of the sampled frequency of the second harmonic to the        fundamental frequency (decibels; averaged over the first 50 ms):        -   male speaker, prototypically gentle: e.g. 19.1 or −13.6;            prototypically hard: e.g. 1.6 or 3.7        -   female speaker, prototypically gentle: e.g. −15.8 or −15.9;            prototypically hard, e.g. 8.9 or 11.2        -   annotated database of patients recorded during exercises:            gentle: −13.6 (average value) ±12.1 (standard deviation);            hard: −3.2 (average value) ±15.7 (standard deviation)            It is clear that here a good distinction can be drawn            between the gentle and hard voice onsets based on the energy            ratio.    -   b. Ratio of summed energies from 0.5*F₀ to 1.5*F₀ to the rest in        the frequency range of 50-4,000 Hz (decibels; obtained over the        first 50 ms):        -   male speaker, prototypically soft: 10.6 or 9.8;            prototypically hard −6.0 or −7.4        -   female speaker, prototypically gentle: 9.3 or 10.9;            prototypically hard: 9.4 or −12.0        -   database: gentle: 7.0 (average value) ±9.9 (standard            deviation); hard: −2.9 (average value) ±12.4 (standard            deviation).

Here again, the differentiation that can be obtained by the methodaccording to the invention from such consideration of the relevantenergy ratios is plain to see.

In the method according to the invention, voice onsets must first berecognized in the signal curve of the digital speech signal, thusidentified and localized in the temporal progression of the signal. Tothis end, the speech signal can advantageously be subdivided into voicedand unvoiced segments. Since the distinction between “voiced” and“voiceless” from local properties of the speech signal is inherentlyerror-prone, it makes sense to use a method that takes advantage ofglobal consistency conditions so as to arrive at a classification thatis as robust as possible. This can be done for example using algorithmsfor extracting the curve of the fundamental frequency. First of all,only the “by-product” of the classification into voiced/voicelesssegments is required.

Within the scope of the invention, to this end preferably the RAPTalgorithm of David Talkin (cf. D. Talkin, Speech Coding and Synthesis,Elsevier Science, 1995, vol. 495, pp. 495-518, Ch. 14, A RobustAlgorithm for Pitch Tracking (RAPT)) is used, which because of the lackof harmonic multiples for gentle voice onsets is better suited forprecise segmentation than, for example, algorithms that operate in thefrequency range.

In order to minimize the false alarms, the segmentation can be smoothedout even more using morphological operators as described e.g. by H.Niemann in Klassifikation von Mustern (Classification of Patterns),Springer Verlag 1983, 2nd Edition, available athttp://www5.cs.fau.de/fileadmin/Persons/NiemannHeinrich/klassifikation-von-mustern/m00-www.pdf.Along with the time of the voice onset, the segmentation also gives theduration of the respective phrase.

The time interval in which—measured from the time of the voice onset—theenergy curve with respect to time at the fundamental frequency (=firstharmonic) and one or more harmonic multiples are determined, can inparticular have a length of 7.5 to 200 ms, preferably of 50 to 100 ms.

A voice onset can be presumed in particular if the time span Δt, withinwhich the energy ratio is shifted in favor of the energy for harmonicmultiples of the fundamental frequency, is between 50 and 100 ms.

Determination of the fundamental frequency, just as in spectralanalysis, can take place for example at intervals of 5 to 20 ms, inparticular at intervals of 10 ms. Regular determination of thefundamental frequency is important, because it can and usually doeschange in the course of the voice onset, so the method must consider thecorrect fundamental frequency for precise analysis.

All in all, this method for detecting voice onsets constitutes asignificant more robust process than methods based purely on the soundlevel, as is the case in the existing methods. A further advantage isthat no calibration is necessary, whereas with methods based on soundlevel, a threshold value must always be set or estimated.

In the method according to the invention, the use of further relevantacoustic parameters and direct modeling of target behavior can also beapplied. This can increase the great reliability already achieved by theabove-described method even more.

Hereby various acoustic parameters in a plurality of variants can bederived and calculated from the digital speech signal in order to useall possibly relevant information in the speech signal and thus toachieve maximum robustness and reliability.

The mapping of the accordingly multidimensional parameter spaces ontothe classes to be identified (gentle/hard voice onsets) then proceedsespecially with the assistance of data-driven methods by means of anannotated database, typically a collection of voice onsets that wereassessed and classified by experts. The method then operates entirelyautomatically. Further parameter groups that can be considered focus onthe temporal development of the sound level (the only source ofinformation considered in the prior art) and spectrum dominatedparameters that are dedicated in particular to a consideration of theenergies at various frequencies and thus indirectly also to voicequality. An exemplary process that also considers such parameters isdescribed in more detail below.

Along with the invention, a data processing program is also provided forthe computer-aided automated evaluation of quality of a voice onset ofthe speaker with

-   -   a. a voice onset analysis module that is aimed at determining a        time of the voice onset of the speaker from a temporal        progression of a digital speech signal obtained from an acoustic        speech signal of the speaker,    -   b. a fundamental frequency detection module that is aimed at        obtaining a fundamental frequency of the speech signal at the        time of voice onset,    -   c. a fundamental frequency-energy detection module that is aimed        at obtaining from the digital speech signal in a predetermined        time interval from the time of the voice onset the curve of the        energy contained in a speech signal at the fundamental        frequency,    -   d. an overtone-energy detection module that is aimed at        obtaining from the digital speech signal in the predetermined        time interval the curve of the energy contained in the speech        signal for harmonic mental frequency,    -   e. a ratio determination module that is aimed at determining the        temporal progression of the ratio of the energies obtained by        the fundamental frequency energy detection module and the        overtone-energy detection module, and to presume a gentle voice        onset when in the time interval the ratio of energies initially        is dominated by the energy of the fundamental frequency and only        in the further course of the predetermined time interval ratio        of energies shifts in favor of energies at harmonic multiples of        the fundamental frequency.

With such a data processing program the above-described method can becarried out on computerized data processing devices. The data processingprogram can in particular and advantageously also contain adigitalization module to produce the digital speech signal from theacoustic speech signal. The voice onset module can advantageously use aRAPT algorithm according to David Talkin, as is specified more closelyabove, for determining the time of the voice onset. In particular thedata processing program can be configured as an application software(so-called app) for a mobile terminal device such as in particular asmartphone or a tablet computer.

Furthermore, with the invention a computerized data processing device isprovided that contains a data processing program as described above. Thedata processing program in this case is in particular installed on thedata processing device. Advantageously the data processing device canhave an input for receiving an acoustic speech signal. Such an input forexample can be a microphone. Alternatively, however, a receivingcomponent can also be connected to the data processing device, in whichdigitalization of the acoustic speech signal already takes place andthen the digital speech signal is transmitted in high resolution to thedata processing device (either cable connected or wireless, for examplevia an interface according to the Bluetooth standard). Such a receivercomponent can for example be formed by the microphone of the headset.

The data processing device can in particular be configured as a mobileterminal device, for example as a smartphone or tablet computer. Themethod according to the invention for treating persons with speechdisorders, in particular stutterers, is marked by the following steps: Aperson to be treated is asked to speak a speech sample. This request canin particular be made by means of a computer device, such as inparticular by a smartphone or a tablet computer, which is running anapplication software that carries out the method. The speech samplespoken by the person to be treated is recorded as an acoustic speechsignal with a recording device, which can in particular be the computerdevice or can be connected to said device. The recorded acoustic speechsignal is converted to a digital speech signal. The digital speechsignal is automatically analyzed in its temporal progression, wherein inthe temporal progression of the digital speech signal, a time of thevoice onset of the speaker is determined and a fundamental frequency ofthe speech signal at the time of the voice onset is obtained; here fromthe digital speech signal in a predetermined time interval from the timeof the voice onset, the curve with respect to time of the energycontained in the speech signal at the fundamental frequency is obtained.In this embodiment, from the digital speech signal in the predeterminedtime interval the curve with respect to time of the energy contained inthe speech signal at at least one harmonic multiple of the fundamentalsignal is obtained. The temporal progression of the ratio of theenergies obtained according to the above steps is determined; wherein agentle voice onset is presumed if in the time interval the ratio ofenergies obtained in accordance with the above step is initiallydominated by the energy of the fundamental frequency and only in thefurther course of the predetermined time interval in a time span Δt theratio of energies shifts in favor of the energy/energies at the harmonicmultiples of the fundamental frequency. If according to this analysis ofthe speech sample the person to be treated is found to have a gentlevoice onset, a positive feedback is sent to the person to be treated.Otherwise a negative feedback is sent to the person to be treated.

This treatment method can in particular be automated and be applied bythe person with the speech disorder with assistance for example of asmartphone or tablet computer or some other computerized device with thecorresponding recording function and application software for thedescribed signal analysis without the person to be treating having tosee a doctor or some other treatment provider and an office. Thus, theperson to be treated, for example during a pause at work, can workthrough a treatment unit and thus achieve a treatment success andalleviation of the speech disorder. Persons who suffer from speechdisorders can then practice much more frequently and complete thetreatment units so that therapy success is more rapid and durable. Herea software can contain a treatment plan for example, which can besuccessively worked through by the person to be treated. For example,the speech samples can be in predetermined sentences and/or words, whichare shown to the person to be treated on a screen, who must repeat them.This can also for example be individualized and be tailored to thespecific fluency disorder of the person to be treated. Thus, forexample, a therapist can establish a treatment plan and thecorresponding speech samples and the order in which the person to betreated must work through them. It is therapeutically supportive thatwith mobile terminal devices used in the scope of the treatment it ispossible for the first time to use a gentle voice onset in real lifeoutside of the protected therapy space and thus allow the patient topractice, for in such situations, speech disorders and in particulardisfluencies are especially pronounced.

In order for the person to be treated to score a treatment success, itcan be provided that the results of the speech quality analyses arestored and are displayed to the person for example on a screen, forexample in the graphic representation as a “success curve” with respectto time. Also it can be provided within the scope of the treatment thata data connection can be made between the treatment device usedindependently by the person to be treated and a central facility, overwhich the successes of the speech quality analyses can be transmitted tothe central facilities for success monitoring. This can also then becarried out by a therapist who can again transmit data with a response,for example instructions for further actions in treatment to the personto be treated and the device used by him for treatment. This dataconnection can for example be formed in particular over the Internet. Inparticular when broadband data connections and in particular WLAN areavailable, video telephony can also be used, for example via thecorresponding software applications such as Skype. This inclusion ofvideo telephony can markedly simplify the therapeutic monitoring and thepossibility of intervention by the therapist in a specific speechsituation.

Analysis of the digital speech signal can also be further carried outwithin the scope of the treatment process in accordance with theabove-described advantageous embodiments in the description of themethod for evaluating speech quality.

Below—also with reference to the enclosed figures—embodiments of amethod according to the invention for evaluation of speech quality,which is also a significant basis for the treatment of persons withspeech disorders, is described and explained in more detail.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a voice analysis in the case of a gentle voice onset. Here inthe top with the crosses connected by lines is the curve of thefundamental frequency extracted with the RAPT algorithm of David Talkin,wherein the curve of the fundamental frequency in addition gives thetime of the voice onset at around t=1.61 seconds. The middle depictionis of a binarized spectrogram (pixels black if the intensity of afrequency at a respective time is greater than a threshold value, here25 dB). The fundamental frequency and its harmonic multiples may beidentified there as parallel horizontal lines. It can be seen that thelowermost line, which shows the fundamental frequency, is initiallyisolated and only in the further course (from around t=1.68 seconds) isaccompanied by its harmonic multiples. In the bottom the speech signalper se is shown. At the start of the voice onset at around t=1.65seconds, the speech signal is similar to a simple sine oscillation.

FIG. 2 is a depiction comparable with FIG. 1 with the depictions of thefundamental frequency curve according to the RAPT algorithm (top),spectrogram (center) and speech signal (bottom) for a voice analysis inthe case of a hard voice onset. The time of the voice onset here is ataround t=5.24 seconds. It can be seen in the spectrogram that here incontrast to the case in FIG. 1 of the gentle voice onset, the harmonicmultiples set in practically simultaneously with the fundamentalfrequency. In addition, it can be seen in the speech signal that thecomplex (not just sinusoidal) oscillation sets in directly with theonset.

FIGS. 3a-c each show a receiver-operator characteristic curve foridentifying various error types versus gentle onsets of a methodoperating in accordance with an automated software controlled system fordifferent subsets of features. (The further the curve runs leftward andup the better. The first angle by sector (false alarm=hit ratio)corresponds to the random baseline.) The depiction in FIG. 3a shows anassessment made purely from the sound level, as is the case in the priorart. The depiction in FIG. 3b shows an assessment limited to theinventive analysis to be provided in any case of the energy ratio of thefundamental frequency to the harmonic multiple/multiples FIG. 3c showsan analysis in accordance with different criteria than sound level andenergy ratio of the fundamental frequency to the harmonicmultiple/multiples and that also other parameters in the assessment canhave an (additional) influence on quality of the analysis. NonethelessFIG. 3b shows that the influence of the parameter “energy ratio offundamental frequency to harmonic multiple/multiples” is especiallypronounced and makes possible an especially good evaluation of thequality of the voice onset. The AUC (area under the curve; 0.5=randombaseline, 1=perfect) is as follows: sound intensity (FIG. 3a ): 0.424;0.466; 0.525; 0.464; 0.403; fundamental frequency/harmonic multiple(FIG. 3b ): 0.704; 0.688; 0.466; 0.792; 0.530; remainder (FIG. 3c );0.634; 0.736; 0.512; 0.518 0.552.

FIG. 4 shows a receiver-operator-characteristic curve for identifyingvarious error types versus gentle onsets based on a version of themethod according to the invention in which the comparison of theenergies at fundamental frequency/harmonic multiple has beensupplemented to assess further parameters identifying a gentle voiceonset (to be precise, the union of the subsets of features considered inFIG. 3a-c is used). AUC: 0.725; 0.780; 0.481; 0.675; 0.573.

FIG. 5 is a flowchart of an embodiment of the inventive method fortreating a stutterer.

FIG. 6 is a schematic illustration of a computing device configured totreat a stutterer in accordance with the inventive method.

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS

For the performance of a method according to the invention, in oneembodiment variant one can proceed as described below.

1. Segmentation

Initially all voice onsets often an utterance are identified andlocalized. Toward this end the speech signal is subdivided into voicedand unvoiced segments. The RAPT algorithm of David Talkin is used forthis purpose. Here it is assumed that the fundamental frequency is atleast 50 Hz and at most 500 Hz. To minimize false alarms, thesegmentation is further smoothed with the help of morphologicaloperators (closure with 0.1 seconds, opening with 0.25 seconds, finalclosing with 0.5 seconds). Along with the time of the voice onset,segmentation also gives the duration of the respective phrase.

2. Feature Calculation

2.1. Base Features

Initially the properties of the speech signal are extracted which lateron form the basis of several acoustic parameters. These are:

Sound Level:

For a short-term analysis window (length 40 ms, step size 10 ms,von-Hann window) the instantaneous sound level over the time from the(logarithmic) energy of the signal is calculated. Before this theaverage value of the signal is subtracted in order to compensate for apossible offset of cheaper and accordingly lower-quality recordinghardware.

Pseudo-Period-Synchronous Sound Level:

In order to calculate the sound level of the speech signal as locally aspossible (independently of adjacent, possibly significantly quieter orlouder oscillation periods), the analysis window is adapted tothe—preferably with the RAPT algorithm—extracted fundamental frequency(doubled period duration, von-Hann-window). Current fundamentalfrequency is F₀; then the duration of the analysis window is 2/F₀. Thiscorresponds at a sampling rate of the speech signal of F_(s) (all speechsignals are sampled down F_(s)=16 kH) to the following number ofsampling values: rounded value of 2F_(s)/F₀.

Short-Term Spectrum:

The (logarithmic) power spectral density is calculated from analysiswindows with 25 ms width, 10 ms step size, and the von-Hann window.

The discrete Fourier transform (DFT) is calculated using the fastFourier transform (FFT). In order to increase the frequency resolution,zero padding is used so that there are 2048 input values. Standardprogram libraries are used for the calculation.

Pseudo-Period-Synchronous Short-Term Spectrum:

For obtaining as local as possible a spectrum the analysis window isadjusted to the extracted fundamental frequency (fourfold periodduration). Duration of the analysis window 4/F₀, that is with therounded value of 4F_(s)/F₀ sampling values.

Short-Term Spectrum from Normalized Cross Correlation:

In order to calculate the frequency spectrum that exactly corresponds tothe information that the preferably used RAPT algorithm used forextraction of the fundamental frequency, the frequency spectrum isdirectly calculated from the normalized cross correction. Here a methodfor calculating the short-term spectrum from the autocorrelationfunction is modified in that instead of the autocorrelation thenormalized cross correlation is used. Thus, the power spectral densityis calculated from the absolute value of the DFT of the crosscorrelation. This makes possible an especially precise calculation ofthe energy of the different harmonic multiples of the fundamentalfrequency. For the cross correlation as in the RAPT algorithm the windowwidth is 7.5 ms and the total width corresponding to the minimal assumedfundamental frequency (50 Hz) is used. In order to increase thefrequency resolution, zero filling is done before the DFT, so that thereare 2048 input values.

2.2. Features

Below are listed the 908 calculated features (acoustic parameters thatare used for classification) used for carrying out the method in thehere described variant. It should be noted that not necessarily all ofthe listed features are required for success of the method. According tothe invention however the focus is at least on properties related to“fundamental frequency/harmonic multiples.” In one variant of theinvention—which does not fall under the invention claimed here, butconstitutes an independent invention—however other features can be usedand combined without the feature “fundamental frequency/harmonicmultiple.”

Other features can be used but do not have to be. Here it is difficultto determine empirically which out of a given quantity of furtherfeatures are actually necessary. Therefore in the pattern recognition itis normal to use many potentially relevant features, and to leave it toa data-driven method to carry out the weighting of the individualfeatures. Experience shows that—assuming modern, robust classificationmethods—this leads to a more reliable classification system than themanual selection of a few promising parameters.

If intervals are spoken of below and time indications given in thatregard, the time indication “0 ms” relates to the time of the voiceonset obtained (with the RAPT algorithm).

Noise:

Calculation of most of the features is to a certain extent influenced bythe background noise. In order to give the classification system thepossibility of compensating for this, features are used that describethe sound level of the noise. The sound level of the noise is initiallyestimated as the minimum of the sound level, (after taking of thelogarithm) smoothed with a 50-ms rectangular window. In order to achieveindependence from the gain of the microphone, normalization is carriedout with the estimated speech sound level calculated as the averagevalue before or after taking of the logarithm, or as the median of thesound intensity in voiced segments. (This yields three features.)

Sound Level of the Onset:

The sound level during an interval at the start of onset is used as afeature in different variants. (Interval-lengths: 10 ms, 50 ms, 100 ms,or 200 ms; sound intensity: independent of voiced/unvoiced, only voicedor pseudo-period-synchronous; three different normalizations as above;this yields 4×3×3=36 features).

Sound Intensity of Direct Onset:

Sound intensity at the time of onset, in different variants (normalizedwith the three estimated speech sound intensities, or the soundintensity during the first 50 ms, 100 ms, or 200 ms; this yields sixfeatures).

Amplitude:

Sound intensity change during the entire phase (maximum/minimum, 99%/1%quantile, 95%/5% quantile, 90%/10% quantile; sound intensity:independent of voiced/voiced, only voiced or pseudo-period synchronous4×3=12 features).

Local Amplitude:

Sound intensity change at the start of onset (intervals: 0 to 50 ms, 0to 100 ms, 0 to 200 ms, 10 to 50 ms, 10 to 100 ms, 10 t 200 ms, 50 to100 ms, 100 to 200 ms; sound intensity: independent of voiced/voiced,only voiced or pseudo-period synchronous; this yields 8×3=24 features).

Rise of Sound Intensity:

Rise of the regression lines at the start of onset (intervals and soundintensities as above; this yields 24 features).

Drop of Sound Level:

It is accumulated whether or how intensely the sound level drops locallyat the start of onset (with a monotone rise, 0 results. Intervals: 50ms, 100 ms, 200 ms; three different sound levels as above; yields 3×3=9features).

Fundamental Frequency/Harmonic Multiple:

In order to form the decisive difference in the voice quality betweengentle and hard onsets, the relative energy of the fundamental frequencyis calculated in different variants. (Short-term spectrum,pseudo-period-synchronous short-term spectrum, or short-term spectrumfrom normalized cross correlation; calculation of the energy from asingle frequency value (index of DFT coefficients: rounded value ofF₀×2048/F_(s)) or accumulated (symmetrical frequency interval,width=fundamental frequency: the absolute values of the DFT coefficientsfrom the (rounded) index 0.5×F₀×2048/F_(s) to (rounded) index1.5×F₀×2048/F_(s) before the taking of the logarithm are added);normalization by the energy of the 2nd harmonic multiples (calculationvariant as with the fundamental frequency) or by the total energy in 50to 4,000 Hz; intervals: 0 to 10 ms, 0 to 50 ms, 0 to 100 ms, 0 to 200ms, 10 to 50 ms, 10 to 100 ms, 10 to 200 ms, entire phrase, 200 tophrase-end; accumulation: average or median, yield 3×2×2×9×2=216features).

Fundamental Frequency:

So as to allow the classification system the possibility of compensatinginfluences of the fundamental frequency, the average value and median ofthe (logarithmic) fundamental frequency are used (yield 2 features).

Variability of the Sound Level:

So as to characterize the creaky voice portions, different measures ofthe variability of the sound level are calculated (intervals: 0 to 50ms, 0 to 100 ms, 0 to 200 ms, entire phrase, 200 ms to phrase end; soundlevel: only voiced or pseudo-period-synchronous; measure: standarddeviation, error of the regression lines, paired variability index,average absolute acceleration; yield 5×2×4=40 features); for sound levelindependently of voiced/voiced the intercals −100 to 0 ms, 0 to 50 ms, 0to 100 ms, 0 to 200 ms, entire phrase, 200 ms to phrase end; yield6×4=24 features).

Variability Voiced/Voiceless:

Likewise for creaky voice portions, it is characterized how often thevoice breaks off now and then, i.e. an unvoiced speech signal is present(proportion of voiced analysis windows over the intervals −100 to 0 ms,0 to 50 ms, 0 to 100 ms, 0 to 200 ms, entire phrase, 200 ms to phraseend; yield 6 features).

Harmonicity:

For further characterization of the voice quality, measures ofharmonicity are calculated (intervals 0 to 10 ms, 0 to 50 ms, 0 to 100ms, 0 to 200 ms, entire phrase, 200 ms to phrase end; only counted ifvoiced; harmonicity from cross correlation c corresponding to thefundamental frequency: log((c+0.1)/(1.0001−c)); yield 6 features).

Spectral Features 1:

For further characterization of the voice quality or the spoken sound(phone), energies in frequency bands are added (24 mel-bands,logarithmic; normalized with average value (after taking of logarithm)over voiced segments; averaged over the intervals 0 to 10 ms, 0 to 50ms, 0 to 100 ms, 0 to 200 ms, 10 to 50 ms, 10 to 100 ms, 10 to 200 ms;yield 24×7=168 features).

Mel bands are an intermediate result in the calculation of the standardspeech recognition features, see below as well as for instancehttp://siggigue.github.io/pyfilterbank/melbank.html. They are (beforetaking the logarithm) summed up energies, weighted with trianglefilters, which obey a auditory-perceptual scale.

Spectral Features 2:

For further characterization of the voice quality or the spoken sound,the standard short-term features of a speech recognition system(mel-frequency cepstral coefficients (MFCC, see S. B. Davis and P.Mermelstein, Comparison of parametric representation for monosyllabicword recognition in continuously spoken sentences, IEEE Transactions onAcoustics, Speech and Signal Processing, vol. 28, no. 4, pp. 357-366,1980) are used; 13 MFCC from 24 mel bands; normalized with average valueover voiced segments; intervals as above; yield 13×7=91 features).

Spectral Features 3:

Spectral features 1+2 at the immediate time of the onset; yield 24+13=37features.

Sound Level Before Onset:

For modelling of an audible aspiration (which according to the targetbehavior of fluency shaping should be avoided), the sound level beforethe onset is characterized (intervals: 100 ms before onset or up to 200ms (depends on available time before onset); normalization as for soundlevel of the direct onset; yield 2×6=12 features).

Spectrum Before Onset:

For the same purpose, energies of the mel frequency bands are addedbefore onset (two intervals as above; normalization; all voicedsegments, or average value of the first 50, 100, or 200 ms after onset;yield 2×4×24=192 features).

3. Classification

The mapping of the 908-dimensional feature vector onto the target class(e.g. gentle/hard onset) is implemented using support-vector machines.This data-driven process is distinguished by its excellent performanceeven with high-dimensional input data. To solve a multi-class problemwith this binary classification method, the scheme“one-against-the-rest” is applied. All features are initially normalizedto a standard deviation of 1; then the average Euclidian length of afeature vector is normalized to 1. In this way, the metaparameter C canbe selected very easily (constant to C=0.001). A linear kernel functionis used. For precise assessment of class probabilities, the outputs ofthe decision functions are suitably transformed.

The basis of the support-vector machines is the solution of a non-linearoptimization problem; they are distinguished in the application byspecial robustness in the classification. Input data are the annotationand features of the training speakers. Through the utilized linearkernel function, the classification phase is a simple affine linearfunction of the features. Standard program libraries are used for thecalculation.

The results are transformed for the purpose of estimating probabilitiesby means of a sigmoid function, seehttp://scikit-learn.org/stable/modules/calibration.html. Here againstandard program libraries are used for calculation.

4. Data

During regular courses of the KST, speech recordings of stutterers werecollected. The speakers were to speak various exercises: sustainedvowels, individual syllables, one-, two-, and three-syllable words, andfinally, short and long sentences. Affordable headsets were used fordigitalization.

There are 3,586 recordings of 16 female and 49 male speakers amountingto a total of 5.0 hours. All speakers are native German speakers.

4.1. Annotation

The material was annotated by five therapists with regard to how welldifferent aspects of the target behavior in fluency shaping wererealized in each case. Of interest here are only the annotations aboutthe voice onset, which in each case was assessed as “good” (i.e.gentle), “hard,” “aspirated,” “creaky,” or “afflicted with glottalstop.” The material was presented in random order; per recording, onaverage there are annotations from 2.8 therapists (i.e. not everytherapist assessed all recordings).

4.1.1. Inter-Rater Agreement

To assess how well the criteria were defined (or the difficulty of theannotation task), it was considered how well the annotations of theindividual therapists agreed with one another. Paired agreement wasassessed, i.e. in each case one therapist was compared with another.Here it should be noted that this is a pessimistic estimate of thequality, as partly, correct answers of the first annotator are countedas errors due to erroneous annotation by the second annotator.

Table 1 shows the confusion matrix for vowel onsets. The recognition ofthe individual error types can be characterized also using hit rates andfalse alarms. These are summarized in Table 2. In sum, it can be saidthat the criteria appear to be well defined throughout, as the agreementis highly significant. For example, the hit rate for “not gentle versusgentle” is 71.5% with 22.7 false alarms (see Table 2), much higher thanthe random agreement to be expected here of 22.7%.

TABLE 1 Confusion matrix for annotation of vowel onsets with pairedcomparison of therapists. Since each of the 1,472 onsets is countedtwice, there are a total of 1,638 + 470 + 203 + 565 + 68 = 2,944 = 1,472× 2 cases. Anno- Annotator 2 [%] tator 1 # Cases gentle hard aspiratedcreaky glottal stop gentle 1,638 77.3 7.1 6.6 6.9 2.1 hard 470 24.7 37.42.6 31.1 4.0 aspirated 203 53.2 5.9 28.6 10.8 1.5 creaky 565 20.0 26.03.9 49.6 0.5 glottal 68 51.5 27.9 4.4 4.4 11.8 stop

TABLE 2 Hit rate/false alarm in % for recognition of different errortypes at voice onset (vowel onsets, paired comparison of therapists).not gentle hard aspirated creaky glottal stop vs. gentle 71.5 @ 22.737.4 @ 7.1 28.6 @ 6.6 49.6 @ 6.9 11.8 @ 2.1

4.2. Acted (Prototypical) Data

In addition, the therapists (six female, one male speaker; all nativeGerman speakers) spoke some of the material in two variants, once innormal speech, and once in the connected, gentle speech technique offluency shaping. Here there are 5,338 recordings amounting to a total of4.7 hours.

Under the simplified assumption that in this way the corresponding hardand gentle voice onsets were generated, there was no annotation of thispart of the material. This was confirmed by some checks on a randombasis at least for voice onsets with vowels. In contrast to the abovedata from stutterers, where there are many borderline cases, this actedmaterial contains only clear, prototypical realizations of either“gentle” or “hard” voice onsets.

5. Experiment and Results

The reliability to be expected from an automatic system forclassification of voice onsets was assessed experimentally. Toward thisend, in each case training was done with some of the speakers, andtesting with the rest of the speakers (i.e. the parameters of theclassification system are estimated using the annotated speakers in thetraining set, and the accuracy is measured using the annotated speakersin the test set). Altogether, a conservative estimate of the accuracy ofthe final system is obtained, which is constructed with the data of allspeakers. A two-fold cross-validation is used; the results are averagedover 10 runs with different sorting of training and test speakers.During training, an internal cross validation is used (again speakerindependent, again two folds) for calibration of probabilities. If thereare several annotations for one onset, all are used concurrently (bothin training and in testing). In this way, the resultant accuracy isdirectly comparable to the inter-rater agreement from section 4.1.1 andlikewise pessimistic (because in part the errors of therapists arecounted as alleged errors of the system).

Only vowel onsets are evaluated. Only recordings are evaluated in whichthe number of automatically segmented phrases (see Section 1) agreeswith the annotation of the therapists. (Reason: for cost efficacy,manual segmentation of the material was dispensed with; by means of thedescribed approach, an approximate, simple assignment of theautomatically and manually segmented phrases is possible through theindex of the respective phrase.) The classification system providesprobabilities for the different voice onsets. Depending on the chosenprobability threshold for reporting a pronunciation error, a strictersystem (with higher hit rate but also higher false alarm rate) or a morecautious system (lower hit rate but also fewer false alarms) results.This can be illustrated in a so-calledReceiver-Operating-Characteristic-Curve (ROC-curve).

5.1. Inadequacy of Sound Level

So as to verify the hypothesis of inadequacy of using solely the soundlevel as the classification criterion for evaluating voice onsets, inparticular for their classification as “gentle,” the reliability of theclassification system when using the different subsets of the featuresfrom Section 2.2 was examined.

Sound level curve: amplitude, local amplitude, sound level of onset,sound level of direct onset and sound level rise (102 features);fundamental frequency/harmonic multiple (216 features); Remainder:noise, sound level drop; fundamental frequency, variability of soundlevel, variability of voiced/voiceless, harmonicity, spectral features1-3, sound level before onset and spectrum before onset (590 features).

The results may be found in FIG. 3 in the illustrations a to c. Forfeatures based solely on the sound level curve, there is not a gooddetection performance: for example, the area under the curve (AUC) for“gentle” versus “not gentle” at 0.424 is even smaller than the value ofthe random baseline of 0.5.

The fact that this is not based perhaps on a different calculation ofthe sound level parameters, but on the difficulty of the problem, isshown by experiments with the prototypical recordings of the therapists(Section 4.2): If the system is tested on these, the results even withthe sound level are very good. For example, for “gentle” versus “notgentle” an AUC of 0.906 is obtained (with all features an even betterscore of 0.973 results).

Only when the further features are used, good results are obtained onthe realistic data of stutterers, too. It should be noted here (see FIG.3b ) that the analyses based on the fundamental frequency or harmonicmultiples, according to the invention at any rate to be used, areespecially suited for identification of creaky onsets, and the rest (seeFIG. 3c ) for recognition of hard onsets. It should also be recognized,however, that already a reduction of the considered parameters to theenergy values according to invention of fundamental frequency andharmonic multiplies at voice onset result in a significant improvement,in comparison with sound level-only consideration, in the identificationperformance of a method carried out with these parameters andguidelines. The advantages of the respective parameter groups combine inthe system with all features (see FIG. 4).

It could thus be shown that the pattern recognition system recommendedwith the invention achieves reliable recognition of gentle/not-gentlevoice onsets, which can be used for example for the stuttering therapyin accordance with fluency shaping. This also applies for theapplication under realistic conditions. It was shown that previousapproaches (as for instance U.S. Pat. No. 4,020,567) do not achieve this(but do it only for clearly pronounced, prototypical data, which areunrealistic for application). The reliability of the system is in thesame magnitude as that of a therapist. For example, for thedistinguishing of “gentle” vs. “not-gentle,” with 22.7% false alarmsthere is a hit rate of 58.5% for the system, while a therapist onaverage achieves 71.5%.

FIG. 5 is a flowchart of an embodiment of the inventive method fortreating a stutterer, and FIG. 6 is a schematic illustration of acomputing device configured to treat a stutterer. In the method of FIG.5, a recorded speech sample is input to the analysis steps, which aredenoted as steps a through h. These steps have been described above andwe will not repeat the description here. At step i, positive feedback isissued to the patient/subject if a gentle voice onset is identified, i.e., as determined in steps f, g, and h. What is not shown in FIG. 5 isthat negative feedback is issued in the event that gentle voice onset isnot identified, based on the outcome of steps f and g. As shown in FIG.6, an embodiment of a computing device, such as a smartphone, tablet, orother form of computing device, configured in accordance with thepresent invention includes a processor, storage, user interface, andaudio system including at least a microphone and a speaker. In addition,a number of software-based modules are included. These include a voiceonset analysis module, a fundamental frequency module, a fundamentalfrequency-energy determination module, and overtone energy determinationmodule, and a ratio determination module. These modules includeexecutable instructions which cause the computing device to carry outthe inventive method described above.

We claim:
 1. A method for evaluating a quality of a voice onset of aspeaker, wherein an acoustic speech signal of the speaker is recordedand converted into a digital speech signal, wherein the digital speechsignal is analyzed in its temporal progression in order to a. determinea time of voice onset of the speaker in the temporal progression of thedigital speech signal; b. obtain a fundamental frequency of the speechsignal at the time of voice onset; c. from the digital speech signal, ina predetermined time interval from the time of voice onset, extract thecurve with respect to time of the energy contained in the speech signalat the fundamental frequency; d. from the digital speech signal, in apredetermined time interval from the time of voice onset, extract thecurve with respect to time of the energy contained in the speech signalat at least one harmonic multiple of the fundamental frequency; e.determine the time progression of the ratio of the energies obtained insteps c and d; wherein a gentle voice onset is presumed if in the timeinterval the ratio of energies obtained in step e. is first dominated bythe energy of the fundamental frequency, and only further on in thepredetermined time interval in a time span of Δt the ratio of energiesshifts in favor of the energies of the harmonic multiples of thefundamental frequency.
 2. The method according to claim 1, characterizedin that for determination of the time of voice onset in step a., aRobust Algorithm for Pitch Tracking (RAPT) is used.
 3. The methodaccording claim 1, characterized in that the predetermined time intervalhas a length of 7.5 to 200 ms.
 4. The method according to claim 3,characterized in that a timespan Δt of between 50 and 100 ms is assessedas the criterion for a gentle voice onset.
 5. The method according toclaim 1, characterized in that along with an analysis of the ratio ofenergies obtained in step e., further parameters derived from thedigital speech signal are used to determine whether there is a gentlevoice onset.
 6. A data processing program stored on a computer readablemedium for computerized, automated evaluation of a quality of a voiceonset of a speaker, with a. a voice onset analysis module, which isaimed at determining the time of voice onset of the speaker from atemporal progression of a digital speech signal obtained from anacoustic speech signal of the speaker; b. a fundamental frequencymodule, which is aimed at obtaining a fundamental frequency of thespeech signal at the time of voice onset; c. a fundamentalfrequency-energy determination module, which is aimed at determining thecurve with respect to time of the energy contained in the speech signalat the fundamental frequency from the digital speech signal in apredetermined time interval from the time of voice onset; d. an overtoneenergy determination module, which is aimed at determining the curvewith respect to time of the energy contained in the speech signal atharmonic multiples of the fundamental frequency from the digital speechsignal in the predetermined time interval; e. a ratio determinationmodule, which is aimed at determining the temporal progression of theratio of the energies obtained from the fundamental frequency-energydetermination module and from the overtone-energy determination module,and to presume a gentle voice onset if in the time interval the ratio ofthe energies is initially dominated by the energy of the fundamentalfrequency, and only in the further course of the predetermined timeinterval the ratio of the energies shifts in favor of the energies ofthe harmonic multiples of the fundamental frequency.
 7. The dataprocessing program according to claim 6, characterized in that itfurther has a digitalization module for producing the digital speechsignal from the acoustic speech signal.
 8. The data processing programaccording to one of claim 6 or 7, characterized in that the voice onsetanalysis module uses a RAPT algorithm for determining the time of voiceonset.
 9. The data processing program according to claim 6,characterized in that it is configured as application software for amobile terminal device, in particular a smartphone or a tablet computer.10. A computerized data processing device, characterized in that itcontains the data processing program in accordance with claim
 6. 11. Thecomputerized data processing device according to claim 10, characterizedin that it has an input for receiving an acoustic speech signal.
 12. Thecomputerized data processing device in accordance with claim 11,characterized in that it is configured as a mobile terminal device. 13.A method for treating persons with speech disorders, in particularstutterers, with the following steps: a. inviting a person to be treatedto provide a speech sample; b. recording of the speech sample providedby the person to be treated as an acoustic speech signal with arecording device; c. conversion of the received speech signal into adigital speech signal; d. analysis of the digital speech signal in itstemporal progression, wherein i. in the temporal progression of thedigital speech signal, a time of voice onset of the speaker isdetermined, ii. a fundamental frequency of the speech signal at the timeof the voice onset is determined, iii. from the digital speech signal,in a predetermined time interval, the curve with respect to time of theenergy contained in the speech signal is determined from the time ofvoice onset, iv. from the digital speech signal, in the predeterminedtime interval, the curve with respect to time of the energy contained inthe speech signal is determined at at least one harmonic multiple of thefundamental frequency, v. the temporal progression of the ratio of theenergies obtained in steps iii. and iv. is determined, vi. wherein agentle voice onset is presumed if in the time interval the ratio of theenergies obtained in accordance with the above step v. is initiallydominated by the energy of the fundamental frequency and only in thefurther course of the predetermined time interval in a time span Δt theratio of energies shifts in favor of the energy/energies of the harmonicmultiple(s) of the fundamental frequency; e. issuance of positivefeedback to the person to be treated if a gentle voice onset has beenidentified and issuance of negative feedback to the person to be treatedif no gentle voice onset was identified.
 14. The method according toclaim 13, characterized in that a RAPT algorithm is used for determiningthe time of voice onset in step d.i.
 15. The method according to claim13, characterized in that the predetermined time interval has a lengthof from 7.5 to 200 ms, in particular 50 to 100 ms.
 16. The methodaccording to claim 13, characterized in that a time span Δt of between50 and 100 ms is taken as the criterion for a gentle voice onset. 17.The method according to claim 13, characterized in that along with ananalysis of the energy ratio obtained in step d.v., other parametersderived from the digital speech signal are used in order to determinewhether there is a gentle voice onset.
 18. The method according to claim13, characterized in that the recording device is a smartphone or tabletcomputer or a recording device connected to such a smartphone or tabletcomputer, wherein the steps a. and d. to e. are also carried out in thesmartphone or tablet computer using the appropriately programmedsoftware, wherein step c. is also carried out in the smartphone ortablet computer or in the recording device connected to the smartphoneor tablet computer.
 19. The method according to claim 13, characterizedin that the person to be treated is repeatedly invited to provide aspeech sample.
 20. The method according to claim 13, characterized inthat the results of analyses of these speech samples are recorded andgraphically displayed to the person to be treated.
 21. The methodaccording to claim 13, characterized in that the results of the analysisare transmitted to a central evaluating office.
 22. The method accordingto claim 21, characterized in that the central evaluating officeprovides responses to the person to be treated regarding treatmentprogress and/or instructions for a treatment action.
 23. The methodaccording to claim 1, wherein a support-vector machine is used forclassification.
 24. The data processing program according to claim 6,further comprising a support-vector machine for classification.
 25. Themethod according to claim 13, wherein a support-vector machine is usedfor classification.